在寫入資料的時候,時常會有大量資料一次 import 的情況,可能是匯入整個 csv 檔案到資料表、或者是接收 api 將資料匯入。可能會使用 transaction
的方式來寫入確保資料都有正常匯入。
舉例來說,一次想入一筆資料:
2.times.each do |time|
Post.create(
title: "Post test #{time}",
content: "mock content #{time}",
is_available: true,
user_id: 1
)
end
TRANSACTION (0.2ms) BEGIN
User Load (0.2ms) SELECT "users".* FROM "users" WHERE "users"."id" = $1 LIMIT $2 [["id", 1], ["LIMIT", 1]]
Post Create (0.3ms) INSERT INTO "posts" ("title", "content", "is_available", "user_id", "created_at", "updated_at") VALUES ($1, $2, $3, $4, $5, $6) RETURNING "id" [["title", "Post test 0"], ["content", "mock content 0"], ["is_available", true], ["user_id", 1], ["created_at", "2021-10-02 08:40:31.090718"], ["updated_at", "2021-10-02 08:40:31.090718"]]
TRANSACTION (0.6ms) COMMIT
TRANSACTION (0.1ms) BEGIN
User Load (0.2ms) SELECT "users".* FROM "users" WHERE "users"."id" = $1 LIMIT $2 [["id", 1], ["LIMIT", 1]]
Post Create (0.3ms) INSERT INTO "posts" ("title", "content", "is_available", "user_id", "created_at", "updated_at") VALUES ($1, $2, $3, $4, $5, $6) RETURNING "id" [["title", "Post test 1"], ["content", "mock content 1"], ["is_available", true], ["user_id", 1], ["created_at", "2021-10-02 08:40:31.093480"], ["updated_at", "2021-10-02 08:40:31.093480"]]
TRANSACTION (0.2ms) COMMIT
這樣子的寫法會得到每一筆都會是一個 transaction
,所以可能改成用一個 transaction
來包,只會有一次 transaction
。
Post.transaction do
2.times.each do |time|
Post.create(
title: "Post test #{time}",
content: "mock content #{time}",
is_available: true,
user_id: 1
)
end
end
TRANSACTION (0.1ms) BEGIN
User Load (0.2ms) SELECT "users".* FROM "users" WHERE "users"."id" = $1 LIMIT $2 [["id", 1], ["LIMIT", 1]]
Post Create (0.3ms) INSERT INTO "posts" ("title", "content", "is_available", "user_id", "created_at", "updated_at") VALUES ($1, $2, $3, $4, $5, $6) RETURNING "id" [["title", "Post test 0"], ["content", "mock content 0"], ["is_available", true], ["user_id", 1], ["created_at", "2021-10-02 08:44:53.350348"], ["updated_at", "2021-10-02 08:44:53.350348"]]
User Load (0.1ms) SELECT "users".* FROM "users" WHERE "users"."id" = $1 LIMIT $2 [["id", 1], ["LIMIT", 1]]
Post Create (0.3ms) INSERT INTO "posts" ("title", "content", "is_available", "user_id", "created_at", "updated_at") VALUES ($1, $2, $3, $4, $5, $6) RETURNING "id" [["title", "Post test 1"], ["content", "mock content 1"], ["is_available", true], ["user_id", 1], ["created_at", "2021-10-02 08:44:53.352100"], ["updated_at", "2021-10-02 08:44:53.352100"]]
TRANSACTION (0.6ms) COMMIT
不過如果資料大量的時候,SQL 就會一直 n+1 的 INSERT,也可能因為 transaction
把資料表鎖住造成其他人要使用會需要排隊的情況。
所以這篇要來介紹 Activerecord-Import gem:
# Gemfile
gem "activerecord-import"
values =
2.times.each_with_object([]) do |time, array|
array << {
title: "Post test #{time}",
content: "mock content #{time}",
is_available: true,
user_id: 1
}
end
# values => [
# {:title=>"Post test 0", :content=>"mock content 0", :is_available=>true, :user_id=>1},
# {:title=>"Post test 1", :content=>"mock content 1", :is_available=>true, :user_id=>1}
# ]
Post.import values, validate: true
Post Create Many (3.4ms) INSERT INTO "posts" ("title","content","is_available","user_id","created_at","updated_at") VALUES ('Post test 0','mock content 0',TRUE,1,'2021-10-02 08:57:48.839905','2021-10-02 08:57:48.839905'),('Post test 1','mock content 1',TRUE,1,'2021-10-02 08:57:48.839905','2021-10-02 08:57:48.839905') RETURNING "id"
=> #<struct ActiveRecord::Import::Result failed_instances=[], num_inserts=1, ids=[19, 20], results=[]>
會發現只會有一筆 INSERT 且可以指定參數 validate
來看是否需要驗證,不指定則 default 為 true
。
註: 不需驗證的資料在 import 時速度約快有驗證的 5 倍左右
再者一個滿重要的問題,當如果輸入的資料其實其他欄位都相同,或者是差不多,但需要一個新的寫入時間來記錄是新的一筆資料的這種情況,或者是想覆蓋原本的資料,正常的情況可能會做 delete + create
,不過這個套件有提供 Duplicate Key Update
來讓 MySQL 做 ON DUPLICATE KEY UPDATE
,或者是 PG 做 ON CONFLICT DO UPDATE
,相關可參考各資料庫 Duplicate Key Update
的寫法。
註: 須建立 uniq index 將重複資料改為 upsert